A general class of improved population variance estimators under non-sampling errors using calibrated weights in stratified sampling

This paper proposes a new calibration estimator for population variance within a stratified two-phase sampling design. It takes into account random non-response and measurement errors, specifically applying this method to estimate the variance in Gas turbine exhaust pressure data. The study integrates additional information from two highly positively correlated auxiliary variables to develop a general class of estimators tailored for the stratified two-phase sampling scheme. The properties of these estimators, in terms of their biases and mean square errors, have been thoroughly examined and extensively analyzed through numerical and simulation studies. Furthermore, the calibrated weights of the strata are derived. The proposed estimators outperform the natural estimator of population variance. Finally, suitable recommendations have been made for survey statisticians intending to apply these findings to real-life problems.

This paper proposes a new calibration estimator for population variance within a stratified two-phase sampling design.It takes into account random non-response and measurement errors, specifically applying this method to estimate the variance in Gas turbine exhaust pressure data.The study integrates additional information from two highly positively correlated auxiliary variables to develop a general class of estimators tailored for the stratified two-phase sampling scheme.The properties of these estimators, in terms of their biases and mean square errors, have been thoroughly examined and extensively analyzed through numerical and simulation studies.Furthermore, the calibrated weights of the strata are derived.The proposed estimators outperform the natural estimator of population variance.Finally, suitable recommendations have been made for survey statisticians intending to apply these findings to real-life problems.
In many practical scenarios, estimating population variance is a crucial task with wide-ranging applications, spanning various domains including finance, healthcare, and weather forecasting.Actuaries and insurance analysts heavily rely on population variance estimation to make well-informed decisions.In the realm of weather forecasting, grasping the variability in temperature, humidity, and other meteorological factors at diverse locations is fundamental for precise predictions.To bolster the precision of estimators in sample surveys, auxiliary variables play a pivotal role.For instance, when estimating crop yields, incorporating data on the area covered by crops can significantly enhance prediction accuracy.Numerous studies, such as 1 did work on the use of auxiliary information in estimating the finite population variance 2 , developed a class of estimators using auxiliary information for estimating finite population variance, and 3 introduced a new procedure for variance estimation in simple random sampling using auxiliary information 4 .further improved the estimation of finite population variance using dual supplementary information under stratified random sampling, while 6 explored the more efficient use of auxiliary information in population variance estimation, presenting a new family of estimators.
Moreover, recent research has delved into variance estimation using auxiliary information, with innovative approaches like memory type ratio and product estimators 7,8 gaining attention.These endeavors aim to enhance the accuracy and reliability of population variance estimation in diverse sampling designs.
However, sample surveys often encounter practical challenges that result in non-response or missing data.These challenges encompass non-contact, refusal to cooperate, and various other reasons.When a substantial amount of data goes missing, it casts doubt on the reliability of ensuing statistical results.Diverse types of missing data patterns, such as missing at random (MAR) and missing completely at random (MCAR), can be observed.Particularly noteworthy is the MAR pattern, characterized by the probability of missingness being independent of the unobserved data's value.
In the presence of random non-response or measurement errors, various researchers have addressed the need for robust estimators 9 .introduced a class of estimators using auxiliary information for estimating finite 1.In healthcare research, when conducting patient surveys to evaluate the effectiveness of medical treatments, not all patients may respond, and measurement errors can occur due to self-reporting.Accurate population variance estimation in such cases is crucial to making informed decisions about treatment strategies.2. In market research, understanding consumer preferences through surveys is essential for product development and marketing strategies.Non-response from certain demographic groups or errors in survey responses can distort the estimation of market variances, impacting business decisions.3.In educational assessments, when evaluating the performance of schools or educational programs, student participation may vary, and measurement errors can affect the assessment outcomes.Reliable population variance estimation is vital for making informed policy decisions and improving education quality.4. By addressing these issues across diverse fields, this innovative framework aims to provide a reliable approach for accurately estimating population variances, thereby enhancing decision-making processes in real-life scenarios., among others, illustrating its application in handling vague and imprecise observations in populations or samples.Motivated by the aforementioned discussions, the present work proposes a wide class of estimators of population variance in two-phase sampling for the stratified population in the presence of random non-response and measurement errors in sample data.The stratum weights have been optimized using calibration procedures, which enables us to get more accurate estimates of the population variance.The performances of the suggested class of estimators have been deeply examined through empirical and simulation studies.

Sample structure
Consider a finite population of size N divided into L non-overlapping strata, each containing N k (k=1,2,..., L) units.Let Y, X, and Z be the study variable, first and second auxiliary variables, respectively.Let y ki , x ki , and z ki be the ith values of y, x, and z for the k-th (k = 1, 2,..., L) stratum.To estimate the population variance of the study variable Y, It is assumed that the information on the second auxiliary Z is readily available for all the population units.Hence its population variance is known.However, information on the first auxiliary variable is not available for all the units of the population.It is also assumed that the random non-response is observed in the sample data on the study and first auxiliary variables Y and X, respectively.In the first phase, a sample, say S n k of size n k (k=1,2..., L), is drawn from kth strata using simple random sampling without replacement and observed for the variables y and x.Let in the first phase sample of size n k , n k−r 1k respond and random non-response observed on the r 1k units.Again in the second phase, from the n k−r 1k respondent units, another simple random sample without replacement, say S m k , of size m k , is chosen from which m k−r 2k units respond and r 2k units do not respond.

Notations
From now on, we will use the following notations: The population variance of Y, i.e, the characteristics under study S 2 The population mean squares of the kth stratum of the study variable Y. S 2 The population mean squares of the kth stratum for the auxiliary variables X and Z, respectively.
Depending on the responding part of sample S n k , the sample mean square of auxiliary variable X for the kth stratum.
Depending on the responding part of sample S m k , the sample mean square of auxiliary variable X for the kth stratum.
(y kj − ȳn k −r 1k ) 2 : Depending on the responding part of sample S n k , the sample mean square of study variable Y for the kth stratum.
(y kj − ȳm k −r 2k ) 2 : Depending on the responding part of sample S m k , the sample mean square of study variable Y for the kth stratum.
(z kj − zn k −r 1k ) 2 : Depending on the responding part of sample S n k , the sample mean square of auxiliary variable Z for the kth stratum.
Depending on the responding part of sample S m k , the sample mean square of auxiliary variable Z for the kth stratum.

Non-response probability model
The kth stratum is considered based on the random non-response model proposed by Singh and Joarder 28 .In the first phase, a sample of size n k taken from the population, n k − r 1k units responded, while random non- response was observed on the remaining r 1k units, where r 1k may have any value from the set {0, 1, 2, ..., (n k − 2)} .Again, in the second phase, from the n k − r 1k respondent units, m k − r 2k units responded, and r 2k do not respond, where r 2k falls within the range {0, 1, 2, ..., . Non-response may have possible values of (n k − 2) and (m k − 2) in the samples S n k and S m k , respectively.These probabilities will be referred to as p 1 and p 2 .The total number of ways to obtain r jk . Then, the discrete random variables r 1k and r 2k have the correspond- ing probability distributions shown below: where q 1 = 1 − p 1 and q 2 = 1 − p 2 .

Suggested estimator
A wide class of estimators that may be used to estimate the population variance are proposed as follows, assuming the impact of random non-response on both the study variable Y and the first auxiliary variable X.
where In this case, h(s * 2 ) is a class of estimators of S 2 X based on information on s * 2 x n k and s 2 As we proceed, we will examine the composite class of estimators applicable to individual strata in two-phase sampling.
We assume that g(s * 2 ) meets the regularity conditions listed below: • Regardless of the sample chosen, the function g(s * 2 ) takes on values within a closed convex subspace of the four-dimensional real space R 4 that includes the point ) is continuous and bounded.
• The partial derivatives of g(s * 2 ) of the first, second, and third orders exist and are continuous and bounded in R 4 .
(1) The class of estimators T k is extensive, as any parametric function g(s * 2 ) that meets the stated regularity conditions, and has g where b 1 , b 2 and b 3 are the true scalars.

Calibration techniques have been proposed to acquire the optimum strata weights
The new calibration estimator of the population variance under stratified sampling is provided by where )), k = 1, 2, ..., L and we obtain the calibrated strata weights W * k , where k ∈ {1, 2, ..., L}.
Based on the following calibration requirements, the distance function (chi-square type) L k=1 where, c z k = . It is important to note that Q k > 0 are appropriately determined weights that will determine the estimator form.
In Appendix A, detailed derivations have been given.

Bias and mean square error of the suggested estimator
We utilize the transformations provided below while taking into account large sample assumptions to analyze the properties of estimator T: such that |ǫ ik | ≤ 1, ∀ i= 0, 1, 2, 3 and E(ǫ ik ) = 0.According to calculations, the Bias(T) and the MSE(T) of the suggested estimator T, which are accurate to the first order of approximation, are as follows: and where and Appendix B has detailed derivations.
The suggested estimator's minimum mean square error under optimal condition.
We note from Eq. ( 5) that the derivatives d 2k and d 4k have an impact on the MSE of the estimator T. So, in order to acquire the derivatives' optimal values, we minimize the MSE concerning them as follows: (5) Vol.:(0123456789)  6) and ( 7), respectively, in Eq. ( 5) as follows: Effect of measurement error Y and X actual and observed values are denoted by y kj a , x kj a , and y kj o , x kj o , while u kj , and v kj denote the corresponding measurement errors.Then x kj a = x kj o + v kj and y kj a = y kj o + u kj , resulting in V (y kj a ) = V (y kj o ) + V (u kj ) , with zero covariance term because the errors are independent.This implies s 2 The expression for Min.MSE was determined as follows: measurement errors occurred only on the study variable Y and the primary auxiliary variable X, not on the secondary auxiliary variable Z. where

Numerical study
An estimator's performance must first be evaluated in terms of its characteristics before it may be used in practical scenarios.Therefore, an empirical investigation has been conducted in this part using both real and simulated data for the suggested estimator.We are comparing the suggested estimator T and the contemporary estimator τ to see how well they perform in random non-response.The estimator τ is defined as follows: Additionally, we are comparing these estimators with the standard estimator since it is the only available option when dealing with non-response and measurement errors.
The following are the expressions for its MSE, with and without measurement errors, respectively: and The Percentage Relative Efficiency (PRE) of the proposed estimator T concerning the estimator τ is given by PRE= MSE(τ ) Min.MSE(T) * 100 Where Eqs. ( 8)- (11) give the corresponding equations for Min MSE(T) and MSE(τ ), without or with measurement errors, respectively.( 6) and µ

Study based on simulated data
We conducted a simulation relevant to our theoretical findings using the statistical computing software R. To achieve our objectives, we used the MASS package's function mvrnorm to generate data from poisson distributions with given parameters and a given correlation coefficient for the study and the auxiliary variables.To generate data from other acceptable distributions, use the function genCorGen included in the package simstudy .The measurement errors were generated using a univariate standard normal distribution with the function rnorm .Table 1 shows the population parameters for the generated data.
The resulting calibrated stratum weights and PREs in presence of non-response and in absence of nonresponse are shown in Tables 2, 3 and 4, respectively.

Study based on real data
The information in this section demonstrates the practical application of the proposed class of estimators.The dataset utilized is accessible within the UCI machine learning repository, titled "Gas Turbine CO and NOx Emission Data Set." This dataset comprises 36,733 instances featuring 11 sensor measurements from a gas turbine situated in the northwestern region of Turkey, aggregated over an hour using average or sum calculation methods for the analysis of CO and NOx (NO + NO2) flue gas emissions.To conduct the analysis mentioned above, the specific file utilized is gt 2011 .csv.In real-world circumstances, the goal is to estimate the variance as precisely as possible.However, complete data is typically not always available.Therefore, we consider the case where some data on the study variable is unavailable.The statistical characteristics of the population are detailed in Table 1, while the calibrated weights for the strata are listed in Table 5.The PRE (Precentage Relative Efficiency) for both the non-response and absence of non-response cases is presented in Tables 6 and 7, respectively.

Discussion
After conducting a detailed numerical study, we have identified the following key points: 1.The strata weights produced by the calibration procedures exhibit slight discrepancies from the actual ones, as evident in Tables 2 and 5. Nevertheless, our findings indicate that the calibration technique effectively enhances the stratum weights, resulting in more accurate estimates.2. Table 3 reveals a consistent pattern: when p 1 , p 2 ∈(0.05, 0.1), the suggested estimator consistently outper- forms the existing estimator, regardless of the presence or absence of measurement errors.This observation is further supported by the real data presented in Table 6. 3. Further analysis of Tables 3 and 6 reveals that an increase in the value of p 2 , while keeping p 1 constant, results in a higher PRE.This observation is a significant outcome of our research.Additionally, when p 2 remains fixed and p 1 increases, the PRE decreases, aligning with our expectations.4. Tables 4 and 7 demonstrate that the proposed estimator yields a higher Percentage Relative Efficiency (PRE) than the conventional estimator in the absence of non-response also, underscoring the effectiveness of our method, even without non-response.5.It is noteworthy that as the correlation coefficient's value increases, the PRE also increases.Conversely, a decrease in the correlation coefficient leads to a decrease in PRE.The recommended estimator successfully mitigates the adverse effects of random non-response and measurement errors in two-phase stratified sampling.When additional information on two positively related variables is available, the advantages are evident.We anticipate the evolution of more estimators within the proposed class, allowing survey statisticians to provide even more precise estimates.

Conclusions
Our research has illuminated several critical contributions and practical applications: The calibration technique significantly enhances the accuracy of stratum weights, leading to more precise estimates, even in the presence of minor deviations from the actual weights.The proposed estimator consistently outperforms its counterparts within specific parameter ranges, showcasing its robustness in handling measurement errors.The superior Percentage Relative Efficiency (PRE) of our proposed estimator, even in scenarios www.nature.com/scientificreports/without non-response, highlights its effectiveness in improving estimation accuracy.We've observed that the correlation coefficient and the values of p 1 and p 2 play significant roles in the performance of the estimator.The versatility of our estimation approach extends its applicability across diverse fields, including the estimation of variance in simulated data.The results obtained from simulated data are further validated through the analysis of real-world data, such as gas turbine exhaust pressure, confirming the applicability and reliability of our proposed methodology in practical scenarios.
Our study provides valuable methodologies to enhance population variance estimation, particularly in practical scenarios rife with non-response and measurement errors.The consistent and outstanding performance of our proposed estimators corroborates their effectiveness and reliability within the domain of survey statistics.Moreover, incorporating neutrosophic statistics aligns with the need to address uncertainty and imprecision in survey data, further reinforcing the effectiveness of our proposed methodology.The validation of our simulated data against real-world datasets substantiates the applicability and trustworthiness of our proposed methodology in practical, real-life scenarios.

Table 2 .
Calibrated strata weights for simulated data..

Table 3 .
PRE of T w.r.t.τ for simulated poisson data.

Table 4 .
In the absence of non-response, PRE is observed from simulated data when p 1 = p 2 = 0..

Table 5 .
Calibrated strata weights for real data.

Table 6 .
PRE of T w.r.t.τ for real data.

Table 7 .
In the absence of non-response, PRE is observed from real data when p 1 = p 2 = 0.